Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators
نویسندگان
چکیده
Hardware accelerators are becoming ubiquitous high performance scientific computing. They are capable of delivering an unprecedented level of concurrent execution contexts. High-level programming languages (e.g., CUDA), profiling tools (e.g., PAPI-CUDA, CUDA Profiler) are paramount to improve productivity, while effectively exploiting the underlying hardware. We present an optimized numerical kernel for computing the symmetric matrix-vector product on nVidia Fermi GPUs. Due to its inherent memory-bound nature, this kernel is very critical in the tridiagonalization of a symmetric dense matrix, which is a preprocessing step to calculate the eigenpairs. Using a novel design to address the irregular memory accesses by hiding latency and increasing bandwidth, our preliminary asymptotic results show 3.5x and 2.5x fold speedups over the similar CUBLAS 4.0 kernel, and 7-8% and 30% fold improvement over the Matrix Algebra on GPU and Multicore Architectures (MAGMA) library in single and double precision arithmetics, respectively.
منابع مشابه
Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators
Hardware accelerators are becoming ubiquitous high performance scientific computing. They are capable of delivering an unprecedented level of concurrent execution contexts. High-level programming language extensions (e.g., CUDA), profiling tools (e.g., PAPI-CUDA, CUDA Profiler) are paramount to improve productivity, while effectively exploiting the underlying hardware. We present an optimized n...
متن کاملOperator-Level GPU-Accelerated Branch and Bound Algorithms
Branch-and-Bound (B&B) algorithms are well-known tree-based exploratory methods for solving to optimality NP-hard discrete optimization problems. The construction of the B&B tree and its exploration are performed using four operators: branching, bounding, selection and pruning. Such algorithms are irregular which makes challenging their parallel design and implementation on GPU accelerators. Am...
متن کاملMethods for compressible fluid simulation on GPUs using high-order finite differences
We focus on implementing and optimizing a sixth-order finite-difference solver for simulating compressible fluids on a GPU using third-order Runge-Kutta integration. Since graphics processing units perform well in data-parallel tasks, this makes them an attractive platform for fluid simulation. However, high-order stencil computation is memory-intensive with respect to both main memory and the ...
متن کاملA Code Optimization Framework for Performance Portability of GPU Kernels onto Custom Accelerators
The shift toward parallel computing has resulted into a growing interest in computing systems with heterogeneous processing modules. Reconfigurable devices are often employed in such heterogeneous systems due to their low power and parallel processing benefits. An important issue in the programmability of these systems is the need for a single programming interface. Recent works have leveraged ...
متن کاملAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
While general-purpose homogeneous multi-core architectures are becoming ubiquitous, there are clear indications that, for a number of important applications, a better performance/power ratio can be attained using specialized hardware accelerators. These accelerators require specific SDK or programming languages which are not always easy to program. Thus, the impact of the new programming paradi...
متن کامل